Efficient Discovery of Association Rules and Frequent Itemsets through 
Samphng with Tight Performance Guarantees*^ 

Matteo RiondatoWd Eli Upfal 
Department of Computer Science, Brown University, Providence, RI, USA 

{matteo, eli}@ cs.brown.edu 

June 22, 2012 



Abstract 

The tasks of extracting (top-.R') Frequent Itemsets (FI's) and Association Rules (AR's) are fundamental primitives in data mining 
and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the 
need of scanning the entire dataset, possibly multiple times. High quality approximations of FI's and AR's are sufficient for most 
practical uses, and a number of recent works explored the application of sampling for fast discovery of approximate solutions to the 
problems. However, these works do not provide satisfactory performance guarantees on the quality of the approximation, due to the 
difficulty of bounding the probability of under- or over-sampling any one of an unknown number of frequent itemsets. In this work 
we circumvent this issue by applying the statistical concept of Vapnik-Chervonenkis (VC) dimension to develop a novel technique 
for providing tight bounds on the sample size that guarantees approximation within user-specified parameters. Our technique applies 
both to absolute and to relative approximations of (top-K) FI's and AR's. The resulting sample size is linearly dependent on the VC- 
dimension of a range space associated with the dataset to be mined. The main theoretical contribution of this work is a characterization 
of the VC-dimension of this range space and a proof that it is upper bounded by an easy-to-compute characteristic quantity of the 
dataset which we call d-index, namely the maximum integer d such that the dataset contains at least d transactions of length at least 
d. We show that this bound is strict for a large class of datasets. The resulting sample size for an absolute (resp. relative) (e, 5)- 
approximation of the collection of FI's is 0(j^(d + log |)) (resp. 0{j2^^^^^{d\og ^2-e)e I))-* transactions, which is a 

significant improvement over previous known results. We present an extensive experimental evaluation of our technique on real and 
artificial datasets, demonstrating the practicality of our methods, and showing that they achieve even higher quality approximations 
than what is guaranteed by the analysis. 

1 Introduction 

Discovery of frequent itemsets and association rules is a fundamental computational primitive with application in data mining (market 
basket analysis), databases (histogram construction), networking (heavy hitters) and more |15, Sect. 5]. Depending on the particular 
application, one is interested in finding all itemsets with frequency greater or equal to a user defined threshold (FIs), identifying 
the K most frequent itemsets (top-K), or computing all association rules (ARs) with user defined minimum support and confidence 
level. Exact solutions to these problems require scanning the entire dataset, possibly multiple times. For large datasets that do not 
fit in main memory, this can be prohibitively expensive. Furthermore, such extensive computation is often unnecessary, since high 
quality approximation are sufficient for most practical applications. Indeed, a number of recent papers 14116117119] [TOl [T2l [T3l [TTl - 
|22][26 28 30 - 34 ,37 )^421 explored the application of sampling for approximate solutions to these problems. However, the efficiency 
and practicality of the sampling approach depends on a tight relation between the size of the sample and the quality of the resulting 
approximation. Previous works do not provide satisfactory solutions to this problem. 

The technical difficulty in analyzing any sampling technique for frequent itemsets discovery problems is that a-priori any subset 
of items can be among the most frequent ones, and the number of subsets is exponential in the number of distinct items appearing in 
the dataset. A standard analysis begins with a bound on the probability that a given itemset is either over or under represented in the 
sample. Such bound is easy to obtain using a Chernoff-like bound or the Central Limit theorem. The difficulty is in combining the 
bounds for individual itemsets into a global bound that holds simultaneously for all the itemsets. A simple application of the union 
bound vastly overestimates the error probability because of the large number of possible itemsets, a large fraction of which may not be 
present in the dataset and therefore should not be considered. More sophisticated techniques, developed in recent works ll6l [T2l[3T1 . 

*Work was supported in part by NSF award nS-0905553. 

t A shorter version of this paper is scheduled to appear in the proceedings of ECML PKDD 2012. 
■f Contact author. 
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give better bounds only in limited cases. A loose bound on the required sample size for achieving the user defined performance 
guarantees, decreases the gain obtained from the use of sampling. 

In this work we circumvent this problem through a novel application of the Vapnik-Che rvonenkis (VC) dimension concept, a 
fundamental tool in statistical learning theory. Roughly speaking, the VC-dimension of a collection of indicator functions (a range 
space) is a measure of its complexity or expressiveness (see Sect. |2.2| for formal definitions). A major result Il36l relates the VC- 
dimension of a range space to the sufficient size for a random sample to simultaneously approximate all the indicator functions within 
predefined parameters. The main obstacle in applying the VC-dimension theory to particular computation problems is computing the 
VC-dimension of the range spaces associated with these problems. 

We apply the VC-dimension theory to frequent itemsets problems by viewing the presence of an itemset in a transaction as 
the outcome of an indicator function associated with the itemset. The major theoretical contributions of our work are a complete 
caractherization of the VC-dimension of the range space associated with a dataset, and a tight bound to this quantity. We prove that 
the VC-dimension is upper bounded by an easy-to-compute characteristic quantity of the dataset which we call d-index, namely, the 
maximum integer d such that the dataset contains at least d transactions of length at least d. We show that this bound is tight by 
demonstrating a large class of datasets with a VC-dimension that matches the bound. 

The VC-dimension approach provides a unified tool for analyzing the various frequent itemsets and association rules problems 
(i.e., the market basket analysis tasks). We use it to prove tight bounds on the required sample size for extracting FI's with a minimum 
frequency threshold, for mining the top-K FI's, and for computing the collection of AR's with minimum frequency and confidence 
thresholds. Furthermore, we compute bounds for both absolute and relative approximations (see Sec 2.1 for definitions). We show 
that high quality approximations can be obtained by mining a very small random sample of the dataset. For example, the required 
sample size for an absolute (e, (5) -approximation of the collection of FI's is 0{-^{d + log j)) transactions, which is a significant 
improvement over previous known results, as it is smaller and, more importantly, less dependent on parameters such as the minimum 
frequency threshold and the dataset size. Similar results are proven for the top-K FI's and AR's tasks. 

We present an extensive experimental evaluation of our method using real and artificial datasets, to assess the practicality of our 
approach. The experimental results show that indeed our method achieves, and even exceeds, the analytically proven guarantees for 
the quality of the approximations. 



1.1 Previous Work 

Agrawal et al. [1] introduced the problem of mining association rules in the basket data model, formalizing a fundamental task of 
information extraction in large datasets. Almost any known algorithm for the problem starts by solving a FI's problem and then 
generate the association rules implied by these frequent itemsets. Agrawal and Srikant |2| presented Apriori, the most well-known 
algorithm for mining FI's, and FastGenRules for computing association rules from a set of itemsets. Various ideas for improving the 
efficiency of FI's and AR's algorithms have been studied, and we refer the reader to the survey by Ceglar and Roddick |5 | for a good 
presentation of recent contributions. However, the running times of all known algorithms heavily depend on the size of the dataset. 

Mannila et al. |28 1 first suggested the idea that sampling can be used to efficiently obtain the collection of FI's, presenting some 
empirical results to validate the intuition. Toivonen |34| presents an algorithm that, by mining a random sample of the dataset, builds 
a candidate set of frequent itemsets which contains all the frequent itemsets with a probability that depends on the sample size. There 
are no guarantees that that all itemsets in the candidate set are frequent, but the set of candidates can be used to efficiently identify the 
set of frequent itemsets with at most two passes over the entire dataset. The work also suggests a bound on the sample size sufficient 
to ensure that the frequencies of itemsets in the sample are close to their real one. The analysis uses Chernoff bounds and the union 
bound. The major drawback of this sample size is that it depends linearly on the number of individual items appearing in the dataset. 

Zaki et al. |40| show that static sampling is an efficient way to mine a dataset, but choosing the sample size using Chernoff bounds 
is too conservative, in the sense that it is possible to obtain the same accuracy and confidence in the approximate results at smaller 
sizes than what the theoretical analysis suggested. 

Other works tried to improve the bound to the sample size by using different techniques from statistic and probability theory like 
the central limit theorem |fT9l l22l 1411 or hybrid Chernoff bounds ||42]| . 

Since theoretically-derived bounds to the sample size where too loose to be useful, a corpus of works applied progressive sampling 
to extract FI's ||4l |7] |9] [lOl [T2 [13 [IS |20l |2T] |26l |30l [39l . Progressive sampling algorithms work by selecting a random sample and 
then trimming or enriching it by removing or adding new sampled transactions according to a heuristic or a self-similarity measure 
that is fast to evaluate, until a suitable stopping condition is satisfied. The major downside of this approach is that it offers no 
guarantees on the quality of the obtained results. 

Another approach to estimating the required sample size is presented in ifTSll . The authors give an algorithm that studies the 
distribution of frequencies of the itemsets and uses this information to fix a sample size for mining frequent itemsets, but without 
offering any theoretical guarantee. 

A recent work by Chakaravarthy et al. |6 1 gives the first analytical bound on a sample size that is linear in the length of the longest 
transaction, rather than in the number of items in the dataset. This work is also the first to present an algorithm that uses a random 
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sample of the dataset to mine approximated solutions to the AR's problem with quality guarantees. No experimental evaluation of 
their methods is presented, and they do not address the top-K FI's problem. Our approach gives better bounds for the problems 
studied in |6 1 and applies to related problems such as the discovery of top-K FI's and absolute approximations. 

Extracting the collection of top- A' frequent itemsets is a more difficult task since the corresponding minimum frequency threshold 
is not known in advance inTllT4ll . Some works solved the problem by looking at closed top- AT frequent itemsets, a concise represen- 
tation of the collection 13211381 . but they suffers from the same scalability problems as the algorithms for exactly mining FI's with a 
fixed minimum frequency threshold. 

Previous works that used sampling to approximation the collection of top-K FI's 113111331 used progressive sampling. Both works 
provide (similar) theoretical guarantees on the quality of the approximation. What is more interesting to us, both works present a 
theoretical upper bound to the sample size needed to compute such an approximation. The size depended linearly on the number 
of items. In contrast, our results give a sample size that only in the worst case is linear in the number of items but can be (and is, 
in practical cases) much less than that, depending on the dataset, a flexibility not provided by previous contributions. Sampling is 
used by Vasudevan and Vojonovic |37 1 to extract an approximation of the top-K frequent individual items from a sequence of items, 
which contains no item whose actual frequency is less than Jk ~ s for a fixed < £ < 1, where fx is the actual frequency of the 
JsT-th most frequent item. They derive a sample size sufficient to achieve this result, but they assume the knowledge of Jk, which 
is rarely the case. An empirical sequential method can be used to estimate the right sample size. Moreover, the results cannot be 
directly extended to the mining of top-K frequent item(set)s from datasets of transactions with length greater than one. 

1.2 Our Contributions 

By applying tools from statistical learning theory, we develop a general technique for bounding the sample size required for generating 
high quality approximations to frequent itemsets and association rules tasks. Table [T] compares our technique to the best previously 
known results for the various problems (see Sect. |2.1 [ for definitions). Our bounds, which are linear in the VC-dimension associated 
with the dataset, are consistently smaller and less dependent on other parameters of the problem than previous results. An extensive 
experimental evaluation demonstrates the advantage of our technique in practice. 



Task Approximation This work Best previous work 

0(^(|I|+log|))Il9l|22l|3lllD 
;.(^(A + 5 + log(^)0 
0(jL(|i|+log|))EIl|33 
not available 
not available 
;^(A + 5 + log^)B 

Table 1: Required sample sizes (as number of transactions) as a function of the VC-dimension d, the maximum transaction size A, the number of 
items \X\, the accuracy e, the failure probability 5, the minimum frequency 6, and the minimum confidence 7. Note that d < A < 

To the best of our knowledge, our work is the first to provide a caractherization and an explicit bound for the VC-dimension of 
the range space associated to a dataset and to apply the result to the extraction of FI's and AR's from random sample of the dataset. 
We believe that this connection with statistical learning theory can be furtherly exploited in other data mining problems. 

We also believe that our approach can be applied not just to mining collections of frequent itemsets and association rules, which 
can be massive, but also to the mining of small collections of itemsets/association rules that describe the dataset with the minimal 
number of itemsets/association rules possible, as presented in Ii27il . 

Outline. In Sect. |2] we formally define the problem and our goals, and introduce definitions and lemmas used in the analysis. The 
main part of the analysis with deiivation of a strict bound to the VC-dimension of association rules is presented in Sectj3] while our 
algorithms and sample sizes for mining FI's, top- A' FI's, and association rules through sampling are in Sect.|4] Section |5]contains an 
extensive experimental evaluation of our techniques. 

2 Preliminaries 

We now introduce basic definitions and lemmas we will use in later sections. 



absolute ^ {d + log \) 

relative ^^'(2-^)^ (d log 5^ + log \ 

top-K f{d + \og\) 

relative _||±^ log ^ + log i 

absolute O (j^^^ (dlog ^ + log \ 
relative iM±t£i (rfiog ^ + log i' 
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2.1 Datasets, Itemsets, and Association Rules 



A dataset "D is a collection of transactions, where each transaction r is a subset of a ground set X. There can be multiple identical 
transactions in T>. Members of I are called items and members of 2^ are called itemsets. Let |t| denote the number of items 
in transaction r. Given an itemset A G 2^, let Tx>{A) denote the set of transactions in T) that contain A. The support of A, 
(j-d{A) = \Tx>{A)\, is the number of transaction in D that contains A, and the frequency of A, /©(A) = ^"^J-^f^^ , is the fraction of 
transactions in V that contain A. 

Definition 1. Given a minimum frequency threshold 6,Q < 6 < 1, the FI's mining task with respect to 6 is finding all itemsets with 
frequency > 6, i.e., the set 

f\{V,I,0) = {{AJ-D{A)) : Ae2^ ^nd MA) >e}. 

To define the collection of top-K FI's, we assume a fixed canonical ordering of the itemsets in 2-^ by decreasing frequency in 
V, with ties broken arbitrarily, and label the itemsets Ai,A2,..., Am according to this ordering. For a given K, with 1 < K < m, 
we denote with the frequency fv{AK) of the K-th most frequent itemset Ak, and define the set of top-K FI's (with their 
respective frequencies) as 

TOPKiV,I,K) = FI(2?,I,/^^^). 
One of the main uses of frequent itemsets is in the discovery of association rules. 

Definition 2. An association rule W is an expression "A => B" where A and B are itemsets such that A D B = ^. The support 
axi{W) (resp. frequency of the association rule W is the support (resp. frequency) of the itemset Au B. The confidence 

ct,{W) of W is the ratio -^j^^^f^ of the frequency of A U B to the frequency of A. 

Intuitively, an association rule "A B" expresses, throught its support and confidence, how likely it is for the itemset B to 
appear in the same transactions as itemset A, so that when A is found in a transaction it is then possible to infer that B will be present 
in the same transaction with a probability equal to the confidence of the association rule. 

Definition 3. Given a dataset V with transactions built on a ground set I, and given a minimum frequency threshold 9 and a minimum 
confidence threshold 7, the AR's task with respect to and 7 consist in finding the set 

AR{V,I,0,j) = {{WJv{W),cv(W)) I Association Rule VF,/i5(VF) > e,cviW) > 7}. 

Often, with an abuse of the notation, we will say that an itemset A (resp. an association rule W) is in f\{V,I,6) or in 
TOPK(X>,I,is:) (resp. in AR(2?,X, 6*, 7)) and denote this fact with A e f\iV,I,e) ov A e JOPKiV,I,K) (resp. W e 
AR(X',2:,6',7)), meaning that there is a pair {A,f) e FI(D,J,6») or {A,f) e T0PK{V,1,K) (resp. a tiiplet {W,U,Cw) e 
AR{V,I,e,j)). 

In this work we are interested in extracting absolute and relative approximations of the sets FI(P,1, 9), TOPK(X',I, K) and 
AR(D,J,^,7). 

Definition 4. Given a parameter ^abs (resp. erei)> an absolute Eahs-close approximation (resp. a relative Srex-close approximation) of 
FI(P, 1, 9) is a set C = {{A, f^) : Ae 2^, /a e [0, 1]} of pairs {A, /a) where approximates /27(A). C is such that: 

1. C contains all itemsets appearing in FI(2?,T, 9); 

2. C contains no itemset A with frequency f-oiA) < 9 — £abs (resp. fv{A^) < (1 — £rei)^); 

3. For every pair (AJa) e C, it holds \fv{A) - /a| < £abs (resp. \fv{A) - /a| < Erei/cC^))- 

This definition extends easily to the case of top- if frequent itemsets mining using the equivalence 

T0PK(D,2:,if) = FI (l?,!,/^"^^) : 

an absolute (resp. relative) s-close approximation to Fl (v,I, f^^^ is an absolute (resp. relative) e-close approximation to 

TOPK(D,I,ii'). 

For the case of association rules, we have the following definition. 

Definition 5. Given a parameter Sabs (resp. £rei)> an absolute Eabs-ctee approximation (resp. a relative e^-ei-close approximation) of 
AR(2?, J,6»,7) is a set 

C = {{W, fw, cw) '■ association rule W, fw € [0, l],cw G [0, 1]} 
of triplets {W, fw, cw) where fw and cw approximate fv{W) and cd{W) respectively. C is such that: 
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1. C contains all association rules appearing in AR(I?,Z, 9, 7); 

2. C contains no association rule W with frequency f-niW) < 9 ~ Sahs (resp. fv{W) < (1 — £rei)^); 

3. For every triplet (VF, fw, cw) G C, it holds \ fv(W) - fw\ < ^abs (resp. \ fv{W) - fw\ < Eroif)- 

4. C contains no association rule W with confidence cv(W) < 7 — Eabs (resp. cv{W) < (1 — £roi)7); 

5. For every triplet {W, fw, cw) S C, it holds |cp(M^) - cw\ < Cabs (resp. \ct,{W) - cw\ < e^c\Cv{W)). 

Note that the definition of relative e-close approximation to FI(2?,X, 0) (resp. to AR(2?,1, 0, 7)) is more stringent than the 
definition of e-close solution to frequent itemset mining (resp. association rule mining) in fB. Sect. 3]. Specifically, we require an 
approximation of the frequencies (and confidences) in addition to the approximation of the collection of itemsets or association rules 
(property 3 in Def.|4]and properties 3 and 5 in Def.|5]l. 



2.2 VC-Dimension 

The Vapnik-Chernovenkis (VC) Dimension of a space of points is a measure of the complexity or expressiveness of a family of 
indicator functions (or equivalently a family of subsets) defined on that space 1361 . A finite bound on the VC-dimension of a structure 
implies a bound on the number of random samples required for approximately learning that structure. We outline here some basic 
definitions and results and refer the reader to the works of Alon and Spencer |3 Sect. 14.4], Chazelle |8 Chap. 4], and Vapnik ||35]| 
for more details on VC-dimension. 

VC-dimension is defined on range spaces: 

Definition 6. A range space is a pair (X, R) where X is a (finite or infinite) set and i? is a (finite or infinite) family of subsets of X. 
The members of X are called points and those of R are called ranges. 

To define the VC-dimension of a range space we consider the projection of the ranges into a set of points: 
Definition 7. Let {X, R) be a range space and A C X. The projection of i? on A is defined as Pr{A) = {r A : re R}. 

The definition of shattered set will be heavily used in our proofs: 
Definition 8. Let {X, R) be a range space and A d X.lf Pr{A) = 2^, then A is said to be shattered by R. 

The VC-dimension of a range space is the cardinality of the largest set shattered by the space: 

Definition 9. Let S = {X, R) be a range space. The Vapnik-Chervonenkis dimension (or VC-dimension) of S, denoted as VC(S') is 
the maximum cardinality of a shattered subset of X. If there are arbitrary large shattered subsets, then VC(S') = 00. 

The main application of VC-dimension in statistics and learning theory is its relation to the size of the sample needed to approxi- 
mate learning the ranges, in the following sense. 

Definition 10. Let {X, R) be a range space and let A be a finite subset of X. 

I. For < e < 1, a subset i? C ^ is an e-approximation for AifVr E R, we have 

\Anr\ \Bnr 



\A\ \B\ 



< e. (1) 



2. For < p, e < 1, a subset i? C ^ is a relative {p, e) -approximation for A if for any range r € R such that ^^^^ > p we have 



\Anr\ _ |Bnr| 
1^1 \B\ 



< e 



l^^^^ and for any range r E R such that ^-j^ < p we have -^^gp < (1 + s)p. 



\A\ 

An e-approximation (resp. a relative (p, £)-approximation) can be constructed by random sampling points of the point space lfT6l 
Thm. 2.12 (resp. 2.11)] (see also 1231). 



Theorem 1. There is an absolute positive constant c ( resp. c'} such that if [X, R) is a range-space of VC-dimension at most d, 
A (Z X is a finite subset and < e, (5 < 1 (resp. and < p < 1), then a random subset B G A of cardinality m, where 

m>min||A|,^ (^d + log0|, (2) 

(resp. m > min 1 1 A|, ^ (^d\og ^ + log ^^^) is an e-approximation (resp. a relative (p, e)-approximation) for A with probability 
at least 1 — 6. 

Note that throughout the work we assume the sample to be drawn with replacement if m < \A\ (othewise the sample is exactly 
the set A). Loffler and Phillips f25] showed experimentally that the absolute constant c is approximately 0.5. It is also interesting to 
note that an e-approximation of size 0(4- log f) can be built deterministically in time 0{d^'^{^ log |i8j|. 
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3 The Dataset's Range Space and its VC-dimension 



We define one range space that is used in the derivation of the sample sizes needed to approximate the solutions to the tasks of market 
basket analysis. 

Definition 11. Let 2? be a dataset of transactions that are subsets of a ground set I. We define S = {X, R) to be a range space 
associated with V such that: 

1. X = 1) is the set of transactions in the dataset. 

2. R = {Tt,{W) I C Z} is a family of sets of transactions such that for each itemset W CI, the set Tt,{W) = {t £ V \ W C 
t} of all transactions containing W is an element of R. 

Theorem 2. Let V be a dataset and let S — {X, R) be the associated range space. Let d G N. Then VC(5) > d if and only if there 
exists a set AQT) of d transactions from T) such that for each subset B^AofA, there exists an itemset Ijs such that: 

L all transactions in B contain Ig, and 

2. no transaction p (z A\B contains Ig. 

Proof. "<^". From the definition of Ig, we have that Tx>{Ib) f^A = B.By definition of Pb{A) this means that B G Pb{A), for any 
subset B of A. Then Pr{A) = 2-^, which implies VC(5) > d. 

"=^>". LetVC(5') > d. Then by definition of VC-Dimension there is a set ^ C I? of d transactions from 2? such that (A) = 2-^. 
By definition of Pr(^), this means that for each subset B C A there exists an itemset Jg such that Tx){Ib) A = B. We want to 
show that no transaction p £ A\B contains /g. Assume now by contradiction that there is a transaction p* £ A\B containing /g. 
Then p* £ T-o{Ib) and, given that p* £ A, we have p* £ Tt>{Ib) H A. But by consti-uction, we have that Tu{Ib) 1^ A = B and 
p* ^ B because p* £ A\B. Then we have a contradiction, and there can not be such a transaction p* . □ 

Corollary 1. Let V be a dataset and S ~ {V, R) be the corresponding range space. Then, the VC-Dimension VC(S') of S, is the 
maximum integer d such that there is a set A ^ V of d transactions from V such that for each subset B C A of A, there exists an 
itemset /g such that 

L all transactions in B contain Ijs, and 

2. no transaction p £ A\B contains /g. 

Computing the exact VC-dimension of a dataset is extremely expensive from a computational point of view. This does not come 
as a suprise, as it is known that computing the VC-dimension of a range space {X, R) can take time 0(|i?||X|'°s l^^l) |[^ Thm. 4.1], 
It is instead possible to give an upper bound to the VC-dimension of a dataset, and a procedure to efficiently compute the bound. 

Definition 12. Let P be a dataset. The d-index of a dataset is defined as the maximum integer d such that V contains at least d 
transactions of length at least d. 

A note of folklore: if the dataset represents the scientific publications of a given scientist, with transactions corresponding to 
articles and items in a transaction corresponding to the citations received by the paper, then the d-index of the dataset corresponds to 
the h-index of the scientist. 

The d-index d of a dataset 2? can be computed in one scan of the dataset and with total memory 0{d). The scan starts with d* = 1 
and it stores the length of the first transaction. At any given step the procedure stores d* , the current estimate of d, computed as the 
maximum d' such that the the scan up to this step found at least d' transactions with length at least d', and keeps a list of the sizes 
of the transactions longer than d' found so far. There can be no more than d' such transactions. As the scan proceeds, the procedure 
updates d* and the list of transactions sizes greater than d* . 

The d-index is an upper bound to the VC-dimension of a dataset. 

Theorem 3. Let V be a dataset with d-index d. Then the range space S — {X, R) corresponding to T> has VC-dimension at most d. 

Proof. Let £ > d and assume that S has VC-dimension £. From Def. 10 there is a set JC of £ transactions that is shattered by R. 
By definition of d and £, K. must contain a transaction t such that |r| < d. The transaction r is a member of 2^^^ subsets of JC. 
We denote these subsets of K. containing r as At \ 1 < « < 2^^^, labeling them in an arbitrary order. Since K. is shattered (i.e.. 



From the above and the definition of Pr(/C), it foUows that for each set of transactions At there must be a non-empty itemset 
such that 



Pr{1C) = 2'^), we have 



A^^'> £ Pb{K.),1 <i<2 




(3) 
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Since the At are all different from each other, this means 

that the Tx,(B^'0 

are all different from each other, which in turn requires 

that the _B^*^ be all different from each other, for 1 < i < 2^"^. 

Since r e At ^ and t e /C by construction, it follows from ^ that 

TeTi,(B(')),l<z<2^-i. 

From the above and the definition of Tx){Br ^), we get that all the itemsets Br\ 1 < « < 2^^^ appear in the transaction r. But 
|r| < d < ^, therefore r can only contain at most 2"^ — 1 < 2^^^ non-empty itemsets, while there are 2^~^ different itemsets Br\ 
This is a contradiction, therefore our assumption is false and JC cannot be shattered by R, which implies that VC(S') < d. □ 

This bound is strict, i.e., there are indeed datasets with VC-dimension exactly d, as formalized by the following Theorem. 

Theorem 4. There exists a dataset T) with d-index d and such the corresponding range space has VC-dimension exactly d. 

Proof. For d = 1, 1? can be any dataset with transactions of length 1. Let r be any transaction in T) and let a be the item in r. The 
set {r} C 2? is shattered because rx)(a) n {t} = {r} and n {r} = 0. 

Without loss of generality, let the ground set I be N. For a fixed d > 1, let = {0, 1, 2, . . . , i — 1, i + 1, . . . , d} 1 < i < d, and 
consider the set of d transactions JC — {r^, 1 < i < d}. Note that \Ti\ — d and \JC\ — d. 

2? is a dataset containing JC and any number of arbitrary transactions from 2^ of length at most d. Let S = (X, R) be the range 
space corresponding to T). We now show that /C C X is shattered by ranges from R, which implies \/C{S) > d. 

For each A E 2'^ \ {JC, 0}, let 1^1 be the itemset 

y4 = {l,...,d}\{i : T,eA}. 
Let Yjc — {0} and let = {d + 1}. By construction we have 

T^iYA) = A,yA e 2^^ 

i.e., the itemset appears in all transactions in ^ C /C but not in any transaction from JC\A, VA E 2^ . This means that 

Tt>{Ya) n /c = Tj^{Ya) = a e 2^. 

Since \IA G 2'^, Tx)(Ya) G i? by construction, the above implies that 

y^G Pfl(/C),Vy^G 2'= 

This means that JC is shattered by R, hence yC{S) > d. From this and Thm.[3] we can colncude that VC(5) = d. □ 

The datasets built in the proof of Thm.|4]are extremely artificial. Our experiments suggest that the VC-dimension of real datasets 
is usually much smaller than the upper bound presented in Thm. [3] 

4 Mining (top-i^) Frequent Itemsets and Association Rules 

We apply the VC-dimension results to constructing efficient sampling algorithms with performance guarantees for approximating the 
collections of FI's, top-K FI's and AR's. 

4.1 Mining Frequent Itemsets 

We construct bounds for the sample size needed to obtain relative/absolute e-close approximations to the collection of FI's. The 
algorithms to compute the approximations use a standard exact FI's mining algorithm on the sample, with an appropriately adjusted 
minimum frequency threshold, as formalized in the following lemma. 

Lemma 1. Let V be a dataset with transactions built on a ground set X, and let d be the d-index ofD. Let < e, (5 < 1. Let S be 
a random sample ofD with size \S\ = min{|I?|, |§ (d + log y)},/or some absolute constant a Then FI(5,X, — |) is an absolute 
e-close approximation to FI(I?,I, 6) with probability at least 1 — 5. 

Proof. Suppose that 5 is a | -approximation of the range space [X, R) corresponding to V. From Thm.jljwe know that this happens 
with probability at least 1 — 5. This means that G 2-^, fs{X) G [fx){X) — |, fv{X) + |]. This holds m particular for the itemsets 
in C = FI(5,I, 9 - |), which therefore satisfies property 3 from Def.H It also means that G f\{V,I, 9)Js{X) > 6* - |, so C 
also guarantees property 1 from Def.|4] Let now F G 2-^ be such that jv{Y) < 9 — e. Then, for the properties of S, fs{Y) < 9 — ^, 
i.e., Y 1^ C, which allows us to conclude that C also has property 2 from Def.|4] □ 
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One very interesting consequence of this resuh is that we do not need to know the minimum frequency threshold 9 in advance to 
build the sample: the properties of the e-approximation allow to use the same sample for any threshold and for different thresholds, 
i.e., the sample does not need to be rebuilt if we want to mine it with a threshold 9 first and with another threshold 9' later. 

It is important to note that the VC-dimension of a dataset, and therefore the sample size from (|2| needed to probabilistically obtain 
an e-approximation, is independent from the size (number of transactions) in V and also of the size of FI(iS,X, 9). It only depends 
on the quantity d, which is always less or equal to the length of the longest transaction in the dataset, which in turn is less or equal to 
the number of different items 

To obtain a relative e-close approximation, we need to add a dependency on 9 as shown in the following Lemma. 

Lemma 2. Let T>, d, e, and S as in Lemma^ Let S be a random sample ofD with size 

\S\ = min{|2?|, i d\og + log }, 



e^9{2-e) \ ^ 9{2-e) "(5 

for some absolute absolute constant c. Then FI(5,I, (1 — is a relative e-close approximation to f\(T>,I, 9) with probability at 
least 1 — ^. 

Proof. Let p = 9^^. From Thm. flj the sample 5 is a relative (p, e/2)-approximation of the range space associated to V with 
probability at least l-S. For any itemset X in FI(2?,I, 9), we have fv{X) > 9 > p, so fs{X) > (1 - e/2)fv{X) > (1 - e/2)6', 
which implies X e FI(5,I, (1 — so property 1 from Def Hholds. Let now X be an itemsets with fv{X) < (1 — e)9. From 

our choice of p, we always have p > (1 - e)9, so fs{X) < p(l + e/2) < 9{l - e/2). This means X ^ Fl(5,2:, (1 - f )6l)), 
as requested by property 2 from Def. |^ Since (1 — 1)6* = p{l + |), it follows that only itemsets X with fv{X) > p can be in 
f\{S,I, (1 - 1)6')). For these itemsets it holds \ fs{X) - fv{X)\ < |/x,(X), as requested by property 3 from Defji] □ 



4.2 Mining Top- A" Frequent Itemsets 



Given the equivalence TOPK(X>, J, /sT) = f\{V,Ij!^'), we could use the above FI's sampling algorithms if we had a good 
approximation of the threshold frequency of the top-K FI's. 

For the absolute e-close approximation we first execute a standard top-_ftr FI's mining algorithm on the sample to estimate /p^-* 
and then run a standard FI's mining algorithm on the same sample using a minimum frequency threshold depending on our estimate 
of /^^^ Lemma [i] formalizes this intuition. 

Lemma 3. Let T>, d, e, and S be as in Lemma [7] Let K be a positive integer Let S be a random sample of T) with size 
\S\ = min{|2?|, (d + log y)}, for some absolute constant c, then FI(iS,X, — |) /i an absolute e-close approximation to 
TOPK('D,X, K) with probability at least 1 — 5. 

Proof. Suppose that 5 is a | -approximation of the range space (X, R) corresponding to V. From Thm.[r|we know that this happens 

with probabihty at least 1 - (5. This means that, Vy G 2^,fs{Y) e [fv{y)~\, fv(Y) + j\. Consider now /|^\ the frequency of the 

K-ih most frequent itemset in the sample. Clearly, /^^^ > — |, because there are at least K itemsets (for example any subset of 

size K of TOPK(I?, I, K)) with frequency in the sample at least — | . On the other hand f'^J^^ < + 1, because there cannot 

be K itemsets with a frequency in the sample greater than /p^^ ^ + f ■ only itemsets with frequency in the dataset strictly greater than 

f^^ can have a frequency in the sample greater than f'^^ + 1, and there are at most K — 1 such itemsets. Let now 77 — f^^^ — |, and 

consider Fl (5, 1, 77). We have 77 < /^^^ - |, so for the properties of 5, TOPK{V,I,K) = FI(2?,I, /4^^) C Fl (5, Z, 77), which then 

guarantees property 1 from Def.|4] On the other hand, let Y be an itemset such that /p [Y) < ff^^ ~e. Then fs {Y) < — |e < ry, 
so F ^ FI(5,1, 77), corresponding to property 2 from Def.[4] Property 3 from Def.|4]follows from the properties of S. □ 

Note that as in the case of the sample size required for an absolute e-close approximation to FI(2?,I, 9), we do not need to know 
K in advance to compute the sample size for obtaining an absolute e-close approximation to TOPK(I?,I, K). 

Two different samples are needed for computing a relative e-close approximation to TOPK(I?,I, K), the first one to compute a 
lower bound to the second to extract the approximation. Details for this case are presented in Lemma|4| 

Lemma 4. LetT>, d, e, and 5 be as in Lemma^ Let K be a positive integer Let 5i, $2 be two reals such that {l~5i){l — S2) > (1 — (5). 
Let Si be a random sample ofD with some size \Si\ = ^ + log for some (j> > 2\/2/e and some absolute constant c. If 

Z^^' > then let p — and let S2 be a random sample ofD of size min{|I?|, ^(dlog ^ + log j)}/or some absolute 

constant c. Then FI(52,I, (1 — e/2){f^^^ — e/%/20)) is a relative e-close approximation to T0PK(2?,I, /^) with probability at 
least 1 — 6. 
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Proof. Assume that Si is a ^^-approximation for T) and ^2 is a relative (p, £/2)-approximation for T). The probability of these two 
events happening at the same time is at least 1 — (5, from Thm. [T| 

Following the steps of the proof of Lemma[3]we can easily get that, from the properties of Si, 

il'''-^</^f^</r + ^. (4) 



Consider now an element X e TOPK(X',I, if ). We have by definition fv{X) > > - ^ > and from 

the properties of ^2, it follows that fsiX) > (1 - e/2)fv{X) > (1 - e/2)(4f ^ - ^), which implies X e FI(52,X, (1 - 
e/2)(4f ^ - e/^J24>)) and therefore property 1 from Def.jijholds for FI{S2,I, rj). 

Let now Y be an itemset such that fv{Y) < (1 — e)f!^\ From our choice of p we have that fv{A) < p. Then /sal^) < 
(1 + e/2)p < (1 - £/2)(/|f ^ - ^). Therefore, Y ^ FI(52,X, r]) and property 2 from Def |4]is guaranteed. 

Property 3 from Def.|4]follows from Q and the properties of ^2. □ 

4.3 Mining Association Rules 

Our final theoretical contribution concerns the discovery of relative/absolute approximations to AR{'D,I,9,ri) from a sample. 
Lemma|5]builds on a result from ||6] Sect. 5] and covers the relative case, while Lemma|6]deals with the absolute one. 

Lemma 5. Let < 6,e,0,"/ < 1, cj) — max{3, 2 — e + 2\/l — e}, V — p ~ j^^- T) be a dataset with d-index d. Let S 

be a random sample ofD of size min{|2?|, ^^(dlog ^ + log j)}/or some absolute constant c. Then AR(iS,T, (1 — ry)^, jq!^7) is a 
relative e-close approximation to AR(2?,I, 0, 7) with probability at least 1 — 5. 

Proof. Suppose 5 is a relative (p, 77) -approximation for the range space corresponding to V. From Thm. [T] we know this happens 
with probability at least 1 — 5. 

Let W e AR(X',I, 9, 7) be the association rule "A =^ B", where A and B are itemsets. By definition fviW) = /x)(A U i?) > 
> p. From this and the properties of S, we get 

fsiW) = fsiA UB)>il- 7^)fv{A U S) > (1 - 77)0. 

Note that, from the fact that /©(VF) = /^(A U B) > 9, it follows that fv{A), fviB) > > p, for the anti-monotonicity 
property of the frequency of itemsets 

MA) 



By definition, cv{W) = > 7. Then, 



^ {l-v)fv{W) 1-77 fvjW) 1-r, 
' fs{A) - {l + v)MA) - 1 + 77 ■ fv{A) -1 + rj^- 

It follows that W e AR(5,I, (1 - 77)6', ^^^7), hence property 1 from Def.|5]is satisfied. 

Let now Z be the association rule "C D", such that fv{Z) — f-piC U D) < (1 — e)9. But from our definitions of rj and p, it 
follows that fviZ) < p < 9, hence fs{Z) < (1 + ri)p < (1 - i])9, and therefore Z ^ AR(5,X, (1 - 7^)6', {^7), as requested by 
property 2 from Def. [5] 

Consider now an association rule Y — "E ^ F" such that cp(y) < (1 — e)^. Clearly, we are only concerned with Y such that 
fv{Y) > p, otherwise we just showed that Y can not be in AR(5,X, (1 — ri)9, Yip^7). From this and the anti-monotonicity property, 
it follows that fv{E)Jv{F) > p. Then, 

csiY) = ^ < (^ + ^^^"(^) < ii^d - e)7 < 1^7 

where the last inequality follows from the fact that (1 — 77)^ > (1 + 77)(1 — e) for our choice of 77. We can conclude that Y ^ 
AR(5,I, (1 - e)9, {^7) and therefore property 4 from Def.|5]holds. 

Properties 3 and 5 from Def. |5]follow from the above steps (i.e., what association rules can be in the approximations), from the 
definition of cj), and from the properties of 5. □ 

Lemma 6. Let D, d, 9, 7, e, and 5 be as in Lemma^and let Eici = maxle 7} " 

Fix <j) — max{3, 2 — Srci + 2\/l — £rci}, V = andp = TT^^- Let S be a random sample of 7) of size mm{\D\, :^{d\og-^ + 
log j)}/or some absolute constant c. Then AR{S,X, (1 — 77)^, jip^7) on absolute e-close approximation to AR{'D,X, 9, 7). 

Proof The thesis follows from Lemma|5]by setting e there to £ici- D 

Note that the sample size needed for absolute e-close approximations to AR(2?,I, 9, 7) depends on 9 and 7, which was not the 
case for absolute e-close approximations to FI(2?,Z, 9) and TOPK(I?,I, K). 
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5 Experimental Evaluation 



In this section we present an extensive experimental evaluation of our methods to extract approximations of FI(I?,I, 61), T0PK(2?,I, K), 
and AR(I?,1, 6, 7). Due to space constraints, we focus on a subset of the results. 

Our first goal is to evaluate the quality of the approximations obtained using our methods, by comparing the experimental results 
to the analytical bounds. We also evaluate how strict the bounds are by testing whether the same quality of results can be achieved at 
sample sizes smaller than those suggested by the theoretical analysis. We then show that our methods can significantly speed-up the 
mining process, fulfilling the motivating promises of the use of sampling in the market basket analysis tasks. Lastly, we compare the 
sample sizes from our results to the best previous work j6|. 

We tested our methods on both real and artificial datasets. The real datasets come from the FIMr04 repository (|http : / /f i mi .] 
fua . ac . be/data/p . Since most of them have a moderately small size, we replicated their transactions a number of times, with 
the only effect of increasing the size of the dataset but no change in the distribution of the frequencies of the itemsets. The artificial 
datasets were built such that their corresponding range spaces have VC-dimension equal to the maximum transaction length d, which 
is the maximum possible as shown in Thm. [3] To create these datasets, we followed the proof of Thm. [4]and used the generator 
included in ARtool (http : / / www . cs . umb . edu/~laur /ARtool/ ), which is similar to the one presented in [2|. We used the 
the FP-Growth and Apriori implementations in ARtool to extract frequent itemsets and association rules. In all our experiments we 
fixed S = 0.1. In the experiments involving absolute (resp. relative) e-close approximations we set e — 0.01 (resp. e ~ 0.05). The 
absolute constant c was fixed to 0.5 as suggested by [25 i . For each dataset we selected a range of minimum frequency thresholds and 
a set of values for K when extracting the top-K frequent itemsets. For association rules discovery we set the minimum confidence 
threshold 7 G {0.5,0.75,0.9}. For each dataset and each combination of parameters we created random samples with size as 
suggested by our theorems and with smaller sizes to evaluate the strictness of the bounds. We measured, for each set of parameters, 
the absolute frequency error and the absolute confidence error, defined as the error |/x)(X) — fs{X)\ (resp. |cx)(y) — cs{Y)\) for 
an itemset X (resp. an association rule Y) in the approximate collection extracted from sample S. When dealing with the problem 
of extracting relative e-close approximations, we defined the relative frequency error to be the absolute frequency error divided by 
the real frequency of the itemset and analogously for the relative confidence error (dividing by the real confidence). In the figures we 
plot the maximum and the average for these quantities, taken over all itemsets or association rules in the output collection. In order 
to limit the influence of a single sample, we computed and plot in the figures the maximum (resp. the average) of these quantities in 
three runs of our methods on three different samples for each size. 

The first important result of our experiments is that, for all problems, for every combination of parameters and every run, the 
collection of itemsets or of association rules obtained using our methods always satisfied the requirements to be an absolute or relative 
e-close approximation to the real collection. Thus in practice our methods indeed achieve or exceed the theoretical guarantees for 
approximations of the collections FI(2?,Z, 6), TOPK(2?,2:, 6), and AR(V,1, 9, 7). 
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Figure 1: Absolute / Relative e-close Approximation Xof\[T),T,9) 



Evaluating the strictness of the bounds to the sample size was the second goal of our experiments. In Figure la we show the 
behaviour of the maximum frequency error as function of the sample size in the itemsets obtained from samples using the method 
presented in Lemma [T| (i.e., we are looking for an absolute e-close approximation to FI(2?,X, 6*)). The rightmost plotted point 
corresponds to the sample size suggested by the theoretical analysis. We are showing the results for the dataset BMS-POS replicated 
40 times (d-index d = 134), mined with 6 = 0.02. It is clear from the picture that the guaranteed error bounds are achieved even 
at sample sizes smaller than what suggested by the analysis and that the error at the sample size derived from the theory (rightmost 
plotted point for each line) is one to two orders of magnitude smaller than the maximum tolerable error e — 0.01. This fact seems to 
suggest that there is still room for improvement in the bounds to the sample size needed to achieve an absolute e-close approximation 
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to Fl (P, 1, 6') . In Fig. lb we report similar results for the problem of computing a relative e-close approximation to Fl (2?, Z, 6) for an 
artificial dataset whose range space has VC-dimension d equal to the length of the longest transaction in the dataset, in this case 33. 
The dataset contained 100 million transactions. The sample size, suggested by Lemma|2] was computed using 6 — 0.01, e = 0.05, 
and S = 0.1. The conclusions we can draw from the results for the behaviour of the relative frequency error are similar to those 
we got for the absolute case. For the case of absolute and relative e-close approximation to T0PK(2?,I, K), we observed similar 
results, which we do not report here because of space constraints. 
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Figure 2: Relative e-close approximation to AR(I?,I, 9, 7) 
The results of the experiments to evaluate our method to extract a relative e-close approximation to AR(I?,1, 6, 7) are presented 



in Fig. 2a and 2b The same observations as before hold for the relative frequency error, while it is interesting to note that the relative 
confidence error is even smaller than the frequency error, most possibly because the confidence of an association rule is the ratio 
between the frequencies of two itemsets that appear in the same transactions and their sample frequencies will therefore have similar 
errors that cancel out when the ratio is computed. Similar conclusions can be made for the absolute e-close approximation case (not 
reported due to space constraints). 

The major motivating intuition for the use of sampling in market basket analysis tasks is that mining a sample of the dataset is 
faster than mining the entire dataset. Nevertheless, the mining time does not only depend on the number of transactions, but also on 
the number of frequent itemsets. Given that our methods suggest to mine the sample at a lowered minimum frequency threshold, this 
may cause an increase in running time that would make our method not useful in practice, because there may be many more frequent 
itemsets than at the original frequency threshold. We performed a number of experiments to evaluate whether this was the case and 
present the results in Fig. |3] We mined the artificial dataset introduced before for different values of 6, and created samples of size 
sufficient to obtain a relative e-close approximation to FI(2?,Z, 9), for e — 0.05 and S — 0.1. Figure[3]shows the time needed to mine 
the large dataset and the time needed to create and mine the samples. It is possible to appreciate that, even considering the sampling 
time, the speed up achieved by our method is relevant, proving the usefulness of sampling. 
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Figure 3: Runtime Comparison. The sample line includes the sampling time. 



Comparing our results to previous work we note that the bounds generated by our technique are always linear in the VC-dimension 
d associated with the dataset. As reported in Table [T] the best previous work f6l presented bounds that are linear in the maximum 
transaction size A for two of the six problems studied here. Figures 4a and 4b shows a comparison of the actual sample sizes for 



relative e-close approximations to F\{V,1, 0) for as function of 9 and e. To compute the points for these figures, we set A = d = 50, 
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corresponding to the worst possible case for our method, i.e., when the VC-dimension of the range space associated to the dataset 
is exactly equal to the maximum transaction length. We also fixed S ~ 0.05 (the two methods behave equally as S changes). For 



Fig. 4a we fixed e — 0.05, while for Fig. 4b we fixed 6 — 0.05. From the Figures we can appreciate that both bounds have similar, 
but not equal, dependencies on 6 and e. More precisely the bound presented in this work is less dependent on e and only slightly 
more dependent on 9. It also evident that the sample sizes suggested by the bound presented in this work are always much smaller 
than those presented in fE\ (the vertical axis has logarithmic scale). In this comparison we used A = d, but almost all real datasets 
we encountered have d ^ A as shown in Table |2] which would result in a larger gap between the sample sizes provided by the two 
methods. 
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Figure 4: Sample sizes for relative £-close approximations to FI(D,I, 9). 
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Table 2: Values for maximum transaction length A and d-index d for real datasets 



6 Conclusions 

In this paper we presented a novel technique to derive random sample sizes sufficient to easily extract high-quality approximations of 
the (top- if ) frequent itemsets and of the collection of association rules. The sample size are linearly dependent on the VC-Dimension 
of the range space associated to the dataset, which is upper bounded by the maximum integer d such that there at least d transactions of 
length at least d in the dataset. This bound is tight for a family of datasets. We conducted an extensive experimental evaluation which 
shows the practical usefulness of our method, confirming our theoretical analysis. In the future we would like to explore possible 
ways of giving a stricter upper bound to the VC-dimension for a given dataset, or whether other measures of sample complexity like 
the triangular rank 1,29 J can suggest smaller samples sizes. 
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